{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 00: Spark Scala Tutorial\n",
    "\n",
    "![](http://spark.apache.org/docs/latest/img/spark-logo-hd.png)\n",
    "\n",
    "Dean Wampler, Ph.D.<br/>\n",
    "[Lightbend](http://lightbend.com)<br/>\n",
    "[dean.wampler@lightbend.com](mailto:dean.wampler@lightbend.com)<br/>\n",
    "[@deanwampler](https://twitter.com/deanwampler)\n",
    "\n",
    "This tutorial demonstrates how to write and run [Apache Spark](http://spark.apache.org) applications using Scala with some SQL. I also teach a little Scala as we go, but if you already know Spark and you are more interested in learning just enough Scala for Spark programming, see my other tutorial [Just Enough Scala for Spark](https://github.com/deanwampler/JustEnoughScalaForSpark).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "This top-level notebook will guide you through the tutorial. But first, let's discuss [Apache Spark](https://spark.apache.org), a distributed computing system written in Scala for distributed data programming. Besides Scala, you can program Spark using Java, Python, R, and SQL! This tutorial focuses on Scala and SQL.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Spark Streaming and SQL\n",
    "\n",
    "Spark includes support for stream processing, using an older [DStream](https://spark.apache.org/docs/latest/streaming-programming-guide.html) or a newer [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) backend, as well as more traditional batch-mode applications.\n",
    "\n",
    "> **Note:** The streaming examples in this tutorial use the older library. Newer examples are TODO.\n",
    "\n",
    "There is a [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html) module for working with data sets through SQL queries or a SQL-like API. It integrates the core Spark API with embedded SQL queries with defined schemas. It also offers [Hive](http://hive.apache.org) integration so you can query existing Hive tables, even create and delete them. Finally, it supports a variety of file formats, including CSV, JSON, Parquet, ORC, etc.\n",
    "\n",
    "There is also an interactive shell, which is an enhanced version of the Scala REPL (read, eval, print loop shell)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why Spark?\n",
    "\n",
    "By 2013, it became increasingly clear that a successor was needed for the venerable [Hadoop MapReduce](http://wiki.apache.org/hadoop/MapReduce) compute engine. MapReduce applications are difficult to write, but more importantly, MapReduce has significant performance limitations and it can't support event-streaming (\"real-time\") scenarios.\n",
    "\n",
    "Spark was seen as the best, general-purpose alternative, so all the major Hadoop vendors announced support for it in their distributions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Spark Clusters\n",
    "\n",
    "Let's briefly discuss the anatomy of a Spark cluster, adapting [this discussion (and diagram) from the Spark documentation](http://spark.apache.org/docs/latest/cluster-overview.html). Consider the following diagram:\n",
    "\n",
    "![](http://spark.apache.org/docs/latest/img/cluster-overview.png)\n",
    "\n",
    "Each program we'll write is a *Driver Program*. It uses a [SparkContext](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext) to communicate with the *Cluster Manager*, which is an abstraction over Hadoop YARN, Mesos, standalone (static cluster) mode, EC2, and local mode.\n",
    "\n",
    "The *Cluster Manager* allocates resources. An *Executor* JVM process is created on each worker node per client application. It manages local resources, such as the cache (see below) and it runs tasks, which are provided by your program in the form of Java jar files or Python scripts.\n",
    "\n",
    "Because each application has its own executor process per node, applications can't share data through the *Spark Context*. External storage has to be used (e.g., the file system, a database, a message queue, etc.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Resilient, Distributed Datasets\n",
    "\n",
    "![Three RDDs Partitioned Across a Cluster of Four Nodes](https://github.com/deanwampler/spark-scala-tutorial/blob/master/images/RDD.jpg?raw=true)\n",
    "\n",
    "The data caching is one of the key reasons that Spark's performance is considerably better than the performance of MapReduce. Spark stores the data for the job in *Resilient, Distributed Datasets* (RDDs), where a logical data set is partitioned over the cluster.\n",
    "\n",
    "The user can specify that data in an RDD should be cached in memory for subsequent reuse. In contrast, MapReduce has no such mechanism, so a complex job requiring a sequence of MapReduce jobs will be penalized by a complete flush to disk of intermediate data, followed by a subsequent reloading into memory by the next job.\n",
    "\n",
    "RDDs support common data operations, such as *map*, *flatmap*, *filter*, *fold/reduce*, and *groupby*. RDDs are resilient in the sense that if a \"partition\" of data is lost on one node, it can be reconstructed from the original source without having to start the whole job over again.\n",
    "\n",
    "The architecture of RDDs is described in the research paper [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SparkSQL\n",
    "\n",
    "[SparkSQL](http://spark.apache.org/docs/latest/sql-programming-guide.html) first introduced a new `DataFrame` type that wraps RDDs with schema information and the ability to run SQL queries on them. A successor called `Dataset` removes some of the type safety \"holes\" in the `DataFrame` API, although that API is still available.\n",
    "\n",
    "There is an integration with [Hive](http://hive.apache.org), the original SQL tool for Hadoop, which lets you not only query Hive tables, but run DDL statements too. There is convenient support for reading and writing various formats like [Parquet](http://parquet.io) and JSON."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Spark Version\n",
    "\n",
    "This tutorial uses Spark 2.2.0.\n",
    "\n",
    "The following documentation links provide more information about Spark:\n",
    "\n",
    "* [Documentation](http://spark.apache.org/docs/latest/).\n",
    "* [Scaladocs API](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package).\n",
    "\n",
    "The [Documentation](http://spark.apache.org/docs/latest/) includes a getting-started guide and overviews of the various major components. You'll find the [Scaladocs API](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package) useful for the tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Examples and Exercises\n",
    "\n",
    "Here is a list of the examples, some of which have exercises embedded as comments. Click each link to navigate to the corresopnding notebook. Note that each name ends with a number, indicating the order in which you should study them:\n",
    "\n",
    "| Example Notebook | Description |\n",
    "| :--------------- | :---------- | \n",
    "| <a href=\"01_Intro.ipynb\" target=\"01_I\">01_Intro</a> | The first example demonstrates several features of typical Spark jobs. The fact that it's easy to embed this code in a notebook demonstrates that it's easy to work with Spark interactively. |\n",
    "| <a href=\"02_WordCount.ipynb\" target=\"02_WC\">02_WordCount</a> | The *Word Count* algorithm: read a corpus of documents, tokenize it into words, and count the occurrences of all the words. A classic, simple algorithm used to learn many Big Data APIs. By default, it uses a file containing the King James Version (KJV) of the Bible. (The `data` directory has a [README](data/README.html) that discusses the sources of the data files.) |\n",
    "| <a href=\"03_WordCount.ipynb\" target=\"03_WC\">03_WordCount</a> | An alternative implementation of *Word Count* that uses a slightly different approach and also uses a library to handle input command-line arguments, demonstrating some idiomatic (but fairly advanced) Scala code. |\n",
    "| <a href=\"04_Matrix.ipynb\" target=\"04_M\">04_Matrix</a> | Demonstrates using explicit parallelism on a simplistic Matrix application. |\n",
    "| <a href=\"05a_Crawl.ipynb\" target=\"05a_C\">05a_Crawl</a> | Simulates a web crawler that builds an index of documents to words, the first step for computing the *inverse index* used by search engines. The documents \"crawled\" are sample emails from the Enron email dataset, each of which has been classified already as SPAM or HAM. |\n",
    "| <a href=\"05b_InvertedIndex.ipynb\" target=\"05b_I\">05b_InvertedIndex</a> | Using the crawl data, compute the index of words to documents (emails). |\n",
    "| <a href=\"06_NGrams.ipynb\" target=\"06_N\">06_NGrams</a> | Find all N-word (\"NGram\") occurrences matching a pattern. In this case, the default is the 4-word phrases in the King James Version of the Bible of the form `% love % %`, where the `%` are wild cards. In other words, all 4-grams are found with `love` as the second word. The `%` are conveniences; the NGram Phrase can also be a regular expression, e.g., `% hated? % %` finds all the phrases with  `hate` and `hated`. |\n",
    "| <a href=\"07_Joins.ipynb\" target=\"07_J\">07_Joins</a> | Spark supports SQL-style joins as shown in this simple example. Note this RDD approach is obsolete; use the SparkSQL alternatives. |\n",
    "| <a href=\"08_SparkSQL.ipynb\" target=\"08_S\">08_SparkSQL</a> | Uses the SQL API to run basic queries over structured data in `DataFrames`, in this case, the same King James Version (KJV) of the Bible used in the previous tutorial. There is also a script version of this file. Using the _spark-shell_ to do SQL queries can be very convenient! |\n",
    "| <a href=\"09_SparkSQL-File-Formats.ipynb\" target=\"09_S\">09_SparkSQL-File-Formats</a> | Demonstrates writing and reading [Parquet](http://parquet.io)-formatted data, namely the data written in the previous example. |\n",
    "| <a href=\"10_SparkStreaming.ipynb\" target=\"10_S\">10_SparkStreaming</a> | The older _structured streaming_ (`DStream`) capability. Here it's used to construct a simplistic \"echo\" server. Running it is a little more involved, as discussed below. |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Going Forward from Here\n",
    "\n",
    "To learn more, see the following resources:\n",
    "\n",
    "* [Lightbend's Fast Data Platform](http://lightbend.com/fast-data-platform) - a curated, fully-supported distribution of open-source streaming and microservice tools, like Spark, Kafka, HDFS, Akka Streams, etc.\n",
    "* The Apache Spark [website](http://spark.apache.org/).\n",
    "* [Talks from the Spark Summit conferences](http://spark-summit.org).\n",
    "* [Learning Spark](http://shop.oreilly.com/product/0636920028512.do), an excellent introduction from O'Reilly, if now a bit dated.\n",
    "\n",
    "## Final Thoughts\n",
    "\n",
    "Thank you for working through this tutorial. Feedback and pull requests are welcome.\n",
    "\n",
    "[Dean Wampler](mailto:dean.wampler@lightbend.com)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Apache Toree - Scala",
   "language": "scala",
   "name": "apache_toree_scala"
  },
  "language_info": {
   "codemirror_mode": "text/x-scala",
   "file_extension": ".scala",
   "mimetype": "text/x-scala",
   "name": "scala",
   "pygments_lexer": "scala",
   "version": "2.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
